Transformer notes


Disorganized rambling about Transformers

1 Intro

1.4 Summary

Missing the key advantage of transformers over rnns: stackability - each layer has the same shaped input and output Disadvantage: limited context window

Key innovation was the positional encoding of embedding vectors and the attention (query, key, vector matching/connecting)

Drawbacks of llm transformers and zero shot and fre shot are repeated twice almost verbatum and are vague, do not have examples or provide insights.

Zero and few shot learning means something different for transformers and llms than it does in machine learning in general. For LLMs zsl means the ability to answer factual questions and word problems that it hasn’t seen before (but studies of llms with test questions created after llms were trained have shown this is an illusion, just memorization, in most cases) FSL is the ability to see two new similar word or analogy question answer pairs and generalize to be able to answer similar questions. ZSL and FSL are essentially just the conversational interface that is pretending to do machine learning.

Zero shot: 1+1? Black is to white as left is to ? Few shot: a thingamajig 4 ft weighs 8 pounds, a 5 ft thingamajig is 10 pounds, how much does at 6 ft thingamajig weigh?

Disadvantage of transformer over rnn us that it requires fixed input text length, similar to cnn. An rnn can continue processing infinite text, similar to reinforcement leanring an rnn could be used to play a video game directly witjout limitting the memory of the rnn to just the recent history with a rolling window

2 Deeper look

2.1 seq to seq

  • encoder decoder architecture praised without describing what is praiseworthy about it or what it is or how it relates to out seq to seq models or generative models or transformers
  • GPT (the transformer architecture for almost all modern generative language models, large and small) is a decoder-only architecture, NOT encoder-decoder. All the claims that encoder-decoder architecture is vital to transformer success, performance, etc is untrue, or at the very least misleading. The essence of transformers is that you do not need a different kind of network for any of the layers, they transform inputs to outputs of the same shape. There is no neck-down to a lower dimensional space. The second to last output layer IS the encoding vector and also the decoding vector.

2.2 encoder

”Important” “vital” are misleading and untrue. Elsewhere descriptions seem too vague to provide any useful insights or understanding of the inner workings of transformers for the reader “seamless” “Residual bypass” “stabilize the training” “improves performance”. The consistent output shape across all layers allows any layer’s output to be used as an encoding or embedding tensor. The suggestion that “residual bypass” is what “improves performance” seems misleading as the input is not bypassing the transformer layer, it is being transformed by each layer in sequence, but I may not understand the Transormer architecture any better than the authors.

Adding the original input to the attention mechanism is shifting the meaning (semantics) of the word embeddings. E.g. adding adjectives to nouns to refine the “thought”/concept that token represents, resolving synonym ambiguity. See 3b1b youtube video on GPT to understand how context-sensitive embeddings are nudged by the attention mechanism to refine the understanding of the input sequence and what the generated output token should be.

I think that tokens can be overwritten in order to accomplish text expansion during encoding by Bidirectional encoder networks and decoding by GPT architectures. I.e. I do not think the embedding/encoding sequence has the same number of padding tokens as the input sequence of embeddings. (TODO HL: check in pytorch implementation of a small pretrained LM)

Fig 2.3

Excellent diagram. Could be improved with fewer blocks representing fewer embeddings and if the tokens for those inputs were shown at the bottom or top of the diagram for each rectangular box representing embeddings. And adding an important context word at the beginning would help show the problem with exponentially decaying weights/gradients/attention, e.g. this token sequence would help illustrate your point better: “[The] [Transform] [er] [is] [the] most [signif] [icant] [NLP] [advance] [ment] [since] [the] [CNN]“. An explanation could mention the acronym CNN having many meanings, including “cable news network” or “convolutional neural networks” for computer vision or AI or NLP. The context token Transform + NLP would refine NLP from “neuro linguistic programming” to “natural language processing” which another layer might then refine CNN further to be talking about CNNs and Transformers for NLP.

Before a diagram or implementation of embeddings is shown, need to explain or diagram briefly what they are, like Grant Sanderson (3b1b) did in YouTube vid on GPT

2.2.2

”Positional encoding” discussion could show how position and order of words is important to the meaning of a sentence, e.g. “Mark made his mark as a renown ed mark s man”. The transformer must have distinct embeddings for each of the 3 “mark”s in order to combine them correctly with other context words and resolve ambiguity over all for the entire sentence.

  • The same words have different meaning if you change their order.
  • Different words can have the same meaning if you change their order.
  • Different words can change their meanings if you order them the same but their are different context words at different relative positions.

scaled dot product

”Effectively controlled, mitigating” Can mention the central limit theorem and the fact that the RSSE of normally distributed residual numbers converges to about 1/sqrt(N) times the RMSE (root mean square of the residuals). The normalizing factor of 1/sqrt(N) comes out of the math of taking the mean to compute RMSE rather than RSS, but delaying the 1/N computation until the very end, to save on compute.

”Critical component of attention mechanism” Should mention that query and value vectors used to add context to embeddings resolving token ambiguity and refining the semantics of the embedding sequence in order to most accurately predict the next token (see my comment on fig 2.3)

Retain relevant context Collect and move the semantics of one token to that of another combining the meanings of tokens correctly

2.5 we are splitting the … Into multiple heads One head might handle adjective modifiers before nouns amother might handle verb phrases another might hand verb modifaction of their noun (subject object verb combinations)

Less sensitive to parameter initialization “Better learn longer term relationships than rnn” The attention mechanism itself and the finite window width are ehat maintainong term relationships Position encoding allows a transformer to isolate the meaning of one token at one position from the same word or phase at another position, such as “pay attention to the attention mechanism” or “large language models have large capacity and myst be trained on large amounts of data using large compute clusters.”

3blue 1Brown

query and key vector example is brilliant as is the heads of multihead attention (adjectives preceding is one query, another attention head might be succeeding verbs)

Todo: Should try to visualize this on a translation transformer. Find out what kinda of queries attention heads accomplish

Instead of encoder decoder, consider embedding and unembedding

Below figure 2.16 “enabling the model to determine which positions are crucial for a given context” It’s not binary but continuous, Should be: The relative importance or weight of each position in the meaning of the sequence for te particular aspect of the document contained in that attention head.

ch 3

The text preceding Equation 3.3.3 ”…, where y is the initial input sequence. could be improved with ”…, where y is the original uncorrupted token sequence."

"In contrast to the models mentioned above…

”Abstractive summarization is generally considered to be more difficult than extractive summarization.” “In its own words” — like the “new sentences” better “Higher compression ratio”

Reader just wants to know which is better. Did humans abstractively summarize the arxiv papers and cnn daily mail dataset or not? What about multiple correct translations or summaries. How can you use an inferior algorithm to measure the quality of a better algorithm.

I like the text rank demo on the set of sentences about ml and dl.

3.1 pointer generator

Glue and rouge are not metrics for model performance they are metrics for text similarity that must be combined with well designed labeled datasets and am algorithm for dealing with multiple corrrct alternative responses. to create a model metric

3.5?

harry potter summary illustrates that si Summarization is an art not a science. There is no single ground truth. A good summarization dataset will provide multiple alternatuve summaries depemding on the theory of mind of the summarizer for the intended reader — a fifinancial amalyst (msnbc reader), a sociologist, a psychologist, a parent, a politician, a father, a mother, a rebelious teenager (mtv rolling stone) One reader may prefer summaries Summarization should be the last task evaluated. Translation should be first. Even for translation there can be style choices that reuire multiple correct answrss. Autoencoder tasks may be better first, with metrics focused on compression ratio and model efficiency/capacity and generalization to unseen data. ~“Increasing model size results in better performance suggesting scaling” is not true. Scale causes memorization. Scale makes it more likely to fool a human judge but also more likely to overfit and failure to generalize harry potter summary illustrates that Summarization is an art not a science. There is no single ground truth. A good summarization dataset will provide multiple alternatuve summaries depemding on the theory of mind of the summarizer for the intended reader — a fifinancial amalyst (msnbc reader), a sociologist, a psychologist, a parent, a politician, a father, a mother, a rebelious teenager (mtv rolling stone) One reader may prefer summaries that another hates.

3.6

Loved the bart summarizer fine tuning example and use of Huggingface datasets and training wrappers.

3

Overall, the chapter needs to consider metric alternatives that do not rely on ngram similarity. Semantic similarity is what matters for abstractive summarization. And paraphrasing models can be used to augment existing single-target summarization and c translation datasets. This can lead to a discussion of the fact that summarizers and translators of technical texts need reasoning and domain expertise that cannot be captured by a language model alone. And this gap will not be noticed without robust evaluation (adversarially designed edge cases a la active learning with tools like dynabench).

“alpha us a special parameter” analogous to temperature “usually underrepresented"

"Encouraging alignment of English and French”

4 types os chatbots

All nongenerative chatbots are corpus-based. It’s just a matter of whether you use a corpus to train a language mode or you search it and retrieve/extract responses directly from the corpus or whether your bot can generate responses from what it retrieves from its internal representation of the corpus (stochastic-parrot-style) or if you to the retrieval explicitly before generation (RAG) or extraction (extractive-summarization + retrieval). Even Eliza uses a corpus - a labeled corpus where each “document” is a word or compound. dictionary of english language emotion words, labeled with their sentiment. It was the first practical application sentiment analysis . Corpus-based should be replaced with retrieval based. And machine learning should be replaced with hybrid architectures. Transformer-based should be replaced with generative chatbots. A 5th type to consider would be symbolic reasoning chatbots such as wolfram alpha and watson.

And chat gpt should be identified as a hybrid chatbot. Because it employs rules (to implement the business logic and terms of use of open ai, and employs retrieval (web search) in Bing and Google and phind and you.com. and openai uses retrieval to provide references and embed advertisements. And all modern chatbots employ various machine learning approaches, from intent recognition to reinforcement learning with human feedback (TPO, PPO) or even direct llm feedback (DPO).

5

Sentiment analysis is binary classification where the positive class is positive sentiment Some sentiment analysis adds the sybtlty of 2d regression - valence and positivity, and perhaps a 3d version Important to distinguish multiclass from multilabel classification. Multiclass just requires a softmax output. It’s equivalent to token prediction - where the number of classes is the size of your vocabulary. Toc: Chapters should be reversed. 1 2 attention 5 class 4 trans 3 sum “Sentiment analysis” is not a classification model type, it is a class label type. And topic analysis is not multiclass it is multilabel - topics are not mutually exclusive. Neither is sentiment.

Better to talk about Binary class Multiclass (mutually exclusive labels) Multiabel (tagging)

  1. Binary positive sentiment classifier/regressor (exen better, is toxic non toxic, so that underlying continuous regression is appreciated)
  2. Binary spam not spam (even better, bot or not bot) - should product designers heed the feedback or not
  3. Multiclass sentiment positive negative spam/irrelevant
  4. Multiclass feature suggestion or bug suggestion (common task for triage of software eng process, product design)
  5. Multilabel (both spam and negative, or both bot and positive, or both feature and bug discussed in same comment)

For next word prediction multilabel is like ambiguity resolution or zeugma simplification Advanced transformers use beam search (multilabel instead of multiclass next word prediction)

  1. Binary topic classification (sarcastic or not, joke or not)
  2. 2d sentiment positive neg valence multilabel
  3. Topic/sentiment (multiclass, with topics other than just sentiment)
  4. Multilabel one vs rest is conceptually identical to having multiple independent binary classifiers

That makes it easy yo progress to multilabel multidimensional regression and classification Its a common misunderstanding that clasification labels cannot be used for regression or that quantifying them with continuous values doesnt help

Hashtag learning is an approachable way to discuss multilabel classificatin me

naive bayes

Socks example is too simplistic and abstract and irrelevant. No significant parallels to nlp NB that i can see.

Use a word count vector example. And then an embedding vectors example. Like for the summarizer.

”Real world examples are often more complex.” Real nlp problems aren’t at all like binary 2d blob classifying. Plot axes arent label and don’t correspond to values of that would ever be generatedfor a tfidf vector or count vector or embedding vector ever.

blob classifier decision boundary “power of naive bayes” is NOT shown in diagram. Like saying at 1st grade count to 10 test is a measure of the power of a math student student… More sophisticated transformer “Fascinating”

”Deep and comprehensive understanding… Profound knowledge” (NOT)

tsne section

T-SNE applied directly, is O(N^2 * D) - D is dimensions, N is data points. So the only scalable implementations use PCA to initialize the starting locations of points in 3d space. So your explanation of dimension reduction techniques should explain that pca is a rotation and then truncation of the high D vectors that preserves the large scale structure of the high dimensional space - vectors that start out far from each other remain far from each other because PCA maximizes variance overall. T-SNE however destroys large scale structure (especially if PCA is not used to initialize the low d vectors) but I’m instead preserves local structure by warping local regions according to local structure in the distance between points. And PCA is O(N * D) and can be implemented in parallel and incrementally. And PCA is deterministic, T-SNE directly applied without PCA will give you different results each time.

Sentiment analysis and topic analysis are not difficult or complex. They are impossible and not well-formed. 3rd party Human readers will not label product review or financial phrase bank texts with consistent topics or sentiment labels. Any model that achieves high accuracy is only useful in reproducing the exact distribution of labels that the dataset reflects. I can’t think of any real world natural language processing application where it is useful to reproduce the labels of previously released articles and their self-assigned labels by readers or authors of those texts. These toy problems do not justify our illustrate any concepts that are useful to users of transformers which have capacity that will overfit to problems like this, becoming useless in the real world. This should just be stated up front — these examples are not useful in the real world but only illustrate how to use transformers and naive bayes models. These are examples where lower capacity models will be more useful in the real world. You also want to point out the pitfalls of fine tuning, the increased specialization, overfitting and brittleness of such a model. It is always important to find the edge cases in your valudation set (the examples a model got wrong or nearly wrong to see why, and then craft additional “out of distribution” test examples that use words and concepts not used in the training or validation set so that you can determine how useful the model will be in the real world. And instead of claiming that low performance of bert justifies more complex approach, add the caveat that this problem is a toy problem and that the improved performance of higher capacity models is an illusion that would lead you astray in your hyperparamerter tuning in the real world unless you are constamtly evaluating the model’s performance in the real world with your ever changing user base and fickle preferences and opinions amd language use. Transformers solve much harder problems with much more diverse and realistic test sets and evaluation metrics, such as generating plausible answers to questions. Their capacity is detrimental when applied to simpler problems, poorly defined problems, or poor quality, low information, datasets.

part 3 intro

”Stunning” generated text output. “Plausible” useful, well-formed. But not stunning. Eliza was only stunning to people unfamiliar with how it worked, not experts in the field And that shock has a short lifespan, limiting the lifetime of your book.

Ch 6

Temperature does not increase creativity it just reduces paroting and causes the model to ise less probable, less plausible, less “average” text more often. Beam search deals with the Donald Biden problem where llms go off the rails at the beginning of a document if they make the wrong choice about a single 50/50 word like the first name of “the president” answers, and the entire rest of the document may seem accurate and plausible but will be 100% false. And the metrics used in this book will underestimate this error rate because its one token out of hundreds

6.3.1 context sensitive embeddings

Great examples: “expiration date on that milk carton” But there are simple word embedding techniques that can disambiguate the meaning of the word “date.” Contextual embeddings in transformers do much more than that. They combine the meanings if the words expiration and carton with the word date to refine the entire concept expressed by the sentence. The query vectors can be thought of like “is this an adjective modifier of this word date, a sentiment modifier, or a temporal modifier (tense)” or “is this other word a location where the date is printed or an object in the past that was associated with this date”. And the key vector says how much each of the dimensions of the embedding vector should be modified based on that relationship. The attention mechanism is what creates context aware embeddings and they are much richer than merely disambiguated word senses.

The maze example would be better if there were traps or monsters in the maze so that having multiple “lives” or beams would be beneficial. Even better if the damage to your “score” in the maze is hidden, like catching a disease, so you don’t know it and can’t measure it until the end of the maze. A language model example used by others is the prompt “President …” Resulting in two nearly equally probable next words Donald and Joe. One of which is correct and the other is not. Two completely different sentences will develop and this where bean search would be better than merely running the model multiple times. The one with joe might have a date that is recent and the one with donald might have a date more than 4 years ago, so your algorithm could choose the correct one only if they were both generated. A change in tone or style or length of statement is not nearly as interesting as one in which there is a change in factual correctness. Other examples if where bean search shines is in recipe generation or math word problem questions. If there is a crux (critical branch) in the logic of the token sequence, beam search will be more likely to generate the correct answer. This is why beam search was invented and where it is useful. The alternative is restarting the generation from the beginning, many times, and this will likely b get caught in the same traps each time producing a lower quality set of possible sequences for your algorithm to choose from. It helps to show the likelihood/probability of each token and show the cumulative likelihood of two or more equal length sequences to show the traps of 50/50 choices the model makes along the way and how one branch can result in a very low probability sentence and another can result in a much higher (more correct) one. This is at the core of ones understanding of how llms work and why they seem so smart, when in reality (even with beam search) they are extremely “dumb”. Because even when a complete sentence is evaluated, you the user must ground the model to tell it which sequence was the factually correct answer to your math or history question. Beam search results in a more confident LLM, and more “plausible” output, but not one significantly more likely to be correct for truly new tasks not in its training set.

topk sampling

Topk is not an alternative to beam search it is a sampling approach that can be used within beam search or not. It’s just a distribution sampling approach. I think the reason why your example did not terminate at the end of the sentence within your 50 token limit is because the decoder ignored the token because it was not likely to be among the topk limit you set for that particular prompt. To compare the You can configure a transformer to prefer an EOS by setting the “early_stopping” parameter as you did for the beam search example but not the topk example.

general notes

”Created” implies agency and is anthropomorphizing and misleading. “More varied” or “new” or “novel” can usually be used instead. Transformers are not creative, they are stochastic parrots with statistical properties, not intelligence, agency, or creativity.

https://arxiv.org/pdf/2312.02143 “Zero shot learning” is a misnomer. The training set is so large it’s very hard to design truly zero shot questions. When papers do use reasoning or math or time-critical fact questions that were created recently, openai models get close to 0% accuracy on the answers.

7.2 DPO

”DPO is particularly receptive to overfitting”

SHOULD BE: “DPO is susceptible to overfitting”

EVEN BETTER: “A reinforcement learning teacher model that uses the initial target model to evaluate itself will always be ovefit to the original unreinforced model. It was not surprising that this approach enabled LLMs to achieve higher ratings by human evaluators more quickly than with PPO or TTPO. DPO does NOT result in a more robust or generalized or accurate language model (if correctness and intelligence are the accuracy metric rather than believability by untrained uninformed users).“

8. Multi modal transformers

Training an MM model on gpt4 generated data does not depend the reasoning ability of LLMs, it broadens the language and visual pattern recognition of the models. Tests of reasoning ability will still fail to show any significant improvement. If the reasoning task is not solvable by pattern recognition no amount of imitation if GPT 4 will help. See the paper on competition level coding challenges and math word problems.

9 Scaling

Mention that sharding and optimization of your model and the “delicate balance” between performance and accuracy “is important” does not provide the reader with general insight or guidance. “We hope you’ve gained insight” without providing that insight. I only noticed summarized and repeated claims made elsewhere in similar academic literature.

10 Ethics

debiasing does not improve the ethics of a bot

”You must measure the bias in whatever llms you use” “You must determine if you model will make toxic comments” These are not helpful. What do you do with your estimates of bias? It is not possible to debias a generative model. All text that generalizes based on knowledge of the world will contain a perspective and express an opinion/bias/stereotype from some perspective. The ethical challenge is to not cause harm to humans as a result of unfair discrimination of someone with an action that does not support their mental or physical/financial/well-being or social status.

How is ethical Transformer keyword filtering possible? If you can’t build an ethical llm you can’t build an ethical filter. An ethics algorithm is embedded in the business logic of whatever app employs transformers to make decisions. Rather than filtering the text generated, the other approaches mentioned in the book should be utilized. Toxic comment classification combined with reprompting or reinforcement learning to steer the model and the conversations towards more helpful and prosocial support of the users.

Your example (blocking the use of keywords such as “stupid”) is an anti-pattern and was clearly counterproductive in the example you provided. If the filter was used on a imposter syndrome coach (e.g. https://syndee.ai/), the first unfiltered text would be much more appropriate than the second. Blocking keywords takes them out of context and causes inappropriate reactions/utterances by the model. Toxic comments may be used by a chatbot or their conversation partners within quotes or as examples of bad behavior of others or to desensitize the user in therapy sessions or as part of a fictional script or pretend scenario. Toxic keywords are not good indicators of toxic utterances. For example one of the most profoundly helpful books for children dealing with bullying is “A Prayer for Owen Meany” where toxic comments and prosocial responses are modeled by the Owen character. In Grice’s rules of cooperative conversations the 2nd filtered example generated text for the “stupid” question in chapter 10 would be considered uncooperative because it is a nonsequitor and does not reflect any active listening on the part of the generative model. It is almost passive agressive. It ignores and doesn’t answer the question.

Regarding bias and discrimination, all models, datasets and humans are biased. You must chose and steer the bias of your bot towards a liberal democratic (humanitarian) bias or whatever other viewpoint your business and data scientists consider ethical for their application. Bias is inherent in the process of machine learning and human decision making and certainly stochastic parot generative models train with RLHF by humans who have biases. Bias is the intended result of generalization and regularization of an ML model. It’s only a harmful stereotype or bias when the conclusions drawn and the actions taken are untrue or unhelpful or unethical. How an LLM is used determines its ethics, not what words it uses or the degree to which it is biased. It should embody ethically biased behaviors (not words) such as kindness and trust and forgiveness and authenticity and humility and all the other prosocial qualities that we expect in an ethical and supportive conversation partner. ChatGPT is the oposite of these because of it’s high level objective function — create excitement and inspire confidence among users, rather than truthfulness and humility and helpfulness. It was trained with PPO and other RLHF feedback that rewards overfitting and overconfidence and lying (it’s not halucenation when it is an intended and expected behavior or statement). Lying is baked into the architecture and training of ChatGPT and most LLMs (those that do not employ fact checking and robust-NLP techniques such as those proposed by MLCommons community and Meta on the dynabench.org platform.

AI Ethics

(see DAIR and Timnit Gebru or Melanie Mitchel publications) Objective functions and business logic determine ethics of an application, ML model, or organization. Profit is contrary to ethics and creates unethical incentives. RLHF (as it is employed in most LLMs) is unethical because it prefers plausible (believable) text over factually correct text and logically reasonable text and prosocial responses to user queries. Lies are unethical. Hiding lies behind unjustified expressions of confidence is unethical. Lack of transparency is unethical - no ability to inteospect it’s own behavior and motivations (architecture, training sets, objective functions) is unethical. “Jailbreaking” should not be required for a chatbot to have an honest and transparent conversation with its users. Using am LLM to profit off the ignorance of others is unethical. Arogance that leads users astray and pollutes the infosphere with misinformation is unethical. Taking jobs from humans doing more ethical work is unethical. Exploiting workers by hiding behind transformers in making unethical HR and management decisions is unethical. Using llms to avoid corporate ethical responsobilities and legal obligations is unethical. These unethical behaviors are the expected and intended result of the kind of unethical RLHF (PPO, DPO, TTPO) training that was used to build most big tech foundational models.

Having a cooperative conversation is a general intelligence (AGI) problem, and cannot be solved with an LLM. It must be solved with a hybrid system that combines logical reasoning with many different NLP technologies besides Transformers. Attempting to make the transformer itself more ethical has the unintended effect of making it less ethical and truthful and effective at what it is used for (generating plausible text). The objective function for RLHF and LLMs can be made to have a bias towards prosocial “behavior” and a helpful and humble “tone”, but the LLM itself is not ethical or unethical. Emphasis should be placed on building factually accurate LLMs that transparently reveal an accurate confidence level in all the factual statements they utter. If asked to solve a math word problem or logic problem, they should not claim any accuracy at all. It is known that they have 0% accuracy on any 0-shot logical puzzle or coding problem or math word problem that does not have an analogous problem in the training set. “Competition Level Problems are Effective LLM Evaluators” arxiv.org/2312.02143v3.pdf

One approach to Ethical AI (and LLMs) would be to use RLHF (TTPO) by trainers trained in cooperative ethical conversatiims and are knowlegable in the facts and ethical and legal content of the conversation. They cannot be incentivized by profit or monetary compensation. It must be organically created in interracting with the real world. You must find out the affects of your words on others of various cultures by interracting with people of those cultures and having them provide teumfircement learming training. The like button makes this diffiicult for the data scientist. More nuanced feedback from users is required. Online realtime psychology and sociology experiments are difficult to design, but your primary goal as an engineer is to make sure your objective functins are aligmed with that of your average user and that you detect misalignment with the goals and long term wellbeing of individual users.

Overall

  • Use of “in this context” as filler throughout the book is distracting. Similarly “it is important”, “critical”, etc filler phrases.
  • Great overview of the tools without mastery of those tools or insights about how to use them well.
  • Whenever possible, share the logits (probabilities) for words generated (output sequences) rather than just the chosen words. This is especially useful in discussion of beam search, topK and topP.
  • Explain multilabel classification as soon as possible, because that’s what a transformer is. It outputs a distribution of possible tokens, not a single token.
  • Temperature, topK, topP and beam search selection of are not part of the transformer itself but rather a part of the LLM system as a whole and can be used with any generative model.
  • Sentiment analysis is a trite and misunderstood example. Transformers are illsuited for it. The positive neutral negative classification problems is a one dimensional regression problem or a binary classification problem. The 3 categories/labels are not independent and there is no firm consistent definition of those categories.
  • First discussion of transformers should use the translation problem because that’s what inspired and popularized transformers. It is what transformers are well-suited for, modeling language (NOT AI agents and chatbots which require logical reasoning and planning ability that LLMs lack)
  • Later you can show how Transformers can be used for multidimensional, multilabel sentiment analysis, a more subtle and useful “understanding” of text. Transformers are unhelpful for binary sentiment analysis. Extreme overcapacity and overfitting. Only 10 labeled examples of whatever sentiment dimension you want to measure is necessary with transformers. Even 0 or 1 shot sentiment analysis is possible with BERT encodings, or even word vector embeddings. Logistic regression on BERT embeddings will be more accurate more precise and have higher recall on any real world classification or sentiment analysis problem than a Transformer, if both architectures are implemented correctly.

Lance Payton Mobile Home Park Dan handling legal side Get tennants on board and not paying rent increase Taking letter and turning it into a petition Sending resources on organizing.